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[57] ABSTRACT 

A network protocol and interface using direct deposit mes- 
saging provides low overhead communication in a network 
of multi-user computers. This system uses both sender- 
provided and receiver-provided information to process 
received messages and to deposit data directly in memory 
and to conditionally interrupt a host processor based on 
control information. Message processing is separated into 
data delivery, which bypasses the host processor and oper- 
ating system, and message actions which may or may not 
require host processor interaction. In this protocol, a mes- 
sage includes an indication of the operation desired by the 
sender, an operand specified by the sender and an operand 
which refers to some information stored at the receiver. The 
receiver ensures that the desired action is permitted and then, 
if the action is permitted perforins the action according to 
both the operand specified by the sender and the state of the 
receiver. The action may be message delivery, wherein the 
operands in the message specify values for use in various 
addressing modes including direct, indirect, post-increment 
and index modes. The action may also be conditionally 
generating an interrupt, wherein the operands are used, in 
combination with the receiver state, to determine whether a 
message requires immediate or delayed action. The action 
may also be an operation on a register in the network 
interface or on other information stored at the receiver. 

17 Claims, 12 Drawing Sheets 
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COMPUTER NETWORK INTERFACE AND Interface for Sender-Based Communication.** by John 

NETWORK PROTOCOL WITH DIRECT Wilkes in Technical Report HPL-OSR-92-13, Hewlett- 

DEPOSIT MESSAGING Packard Laboratories. November 1992. One difficulty with 

the published work on Hamlyn is that it merely provides a 

This application is a continuation of application Ser. No. 5 high level design overview, with few implementation spe- 

087226.541 filed Apr. 12. 1994, now abandoned. This system includes sufficient protection mecha- 
nisms for a multi-user LAN environment but the published 

FIELD OF THE INVENTION work does not specialize it to any network, only an unspeci- 
fied "private multicomputer interconnect," similar to others 

This invention relates to computer network interfaces and 1Q for parallel machines, 

protocols and more particularly to such interfaces and pro- Another system similar to Hamlyn is described in "Effi- 

tocols for low overhead communication. The invention is cient Support for Multicomputing on ATM Networks** by C. 

particularly applicable to asynchronous transfer mode Thekkath et aL, Technical Report TR93 -04-03. Dept of 

(ATM) networks and local-area networks (LANS). Computer Science and Engineering, Univ. of Washington. 

15 Seattle, Wash. Apr. 12, 1993. This system is a software- 

BACKGROUND OF THE INVENTION based emulation of purely sender-based addressing special- 

A communication system is a significant part of any izcd t0 AFM LANs - Signed for distributed system appli- 

modern computer system, A fundamental characteristic of H ^J^_ SU PP°^ for wodcrbased addressing is 

any such communication system is the communication over- not addrcsscd bv The™" et al. 

head. Such overhead determines the kinds of applications 20 One problem with systems which use purely sender-based 

that can be exploited efficiently. Low overhead communi- addressing is that in many cases the location at the receiver 

cation (low latency and low impact on a host as defined whcrc d** is to P laccd &&&& be dependent on the state 

below) is particularly important in parallel, distributed, or of * e rccciver - For example, if incoming messages are to be 

real-time computing systems. queued at the receiver, the location of the end of the queue 

In general, two fundamental properties in communication * is *f CDde J °\ /*fj ver T ° *an<Ue queueing in a 

systems contribute to overheadTuVfirst is data delivery and t^*™!^^'! addrcssin ^ stc f nL * c scndcr ™* 

•L ~ a - « *: c~ -& M u + a i* know the state of the receiver. Therefore, the sender must 

the second is message action. Specifically, data delivery v , . ^_ . rrf . . ^ .... „ 

. , , . , - J either keep track of this location, which becomes difficult 

requires addressing at the receiver and message action . F . zr , , ~^ 

requires interrupts to invoke message action at the receiver. whe ? mere 15 ° De SCDder * DcUn ? to ^ 

Thus, communication involves both data delivery, transfer- *> receiver, or use agonal messages to determine the state of 

ring data from a sender to a receiver, and message action, * e rcc *^ which atonuc ^ ^ ™ C ? °f 

invoking some special action, such as synchrom^tion, on thcsc ^ons increases both latency and impact on the host 

arrival of data aTthe receiver. Message action is often E**? 8 " ° f J^ "J™ 1 * ^ ? UCh eXtenS ? VC 

"a u • * \T knowledge of the receiver by the sender raises protection 

requested by interrupting the processor at the receiver. . . ^ c . , , . 7 . . „ , . * , 

J 35 problems. Similarly, sender-based interrupts also have prob- 

Commumcation systems can be classified according to the lems ^ sender based ^tuxe n m j t < me possible interrupt 

division of burden between sender and receiver. In a operations. For example, if the interrupt status is only a 

receiver-based system, information controlling data delivery Unction of the state maintained by the sender, it becomes 

and message action is localized to the receiver. In contrast, difficult to priority schedule interrupts at the receiver, 

in a s^-based system, the sender plays a more direct role ^ To overcome some of the communication overhead prob- 

oy spearying inrormaaon in eacn message to control data lcms with dthcr d ^tetAivx*. or purely receiver- 

dehvery and message acuon/The key formation for data bascd commurucation/ many new parallel machines use 

delivery is the destination address of the data. In systems varlalions of ^ se Qde r-based and recdver^ased address- 

using receiver-based addressing, the source has no direct . For 1( , rae Mciko ^ of Waltham , MasSt , 

finala< ! drCS ?° ^ message: a message identifies ^ m ^7^^ ^nd/rcccive communication using 

a buffer at the receiver into which the message is stored at ^^s* addressing and a remote read/write model 

some implicit location, e.g. by sequencing a pointer. By usi scndcr _ bascd ad & cssing . To support bulk data 

contrast, in systems using sender-based addressing, the ^ ^ ^ ^ h L a co-processor 

source ^ specifies an address, contained in each message, for demultiplexing and DMA to memory. This DMA usually 

mdicatmg^ectiywhere me message^ ^ scatt ^ gathcr K though typically only with 

receiver. Receiver-based addressing involves sfcniflcant COflStant ^ ^ m £ [ shan $ J^^J^^ 

overhead in cor^arison to sender-base4 addressing. k ^ sento _ bascd ^ M a<U ressing modes 

However, sender-based addressing raises protection issues. m CQmbiDe d mutually exclusively. This mutual exclusive 

The key information for message action, is whether to combination is also true of other new parallel machines 
generate an interrupt on message arrival. In systems using 55 which use sender-based and receiver-based addressing. In 
receiver-based interrupts, an interrupt is generated by the the MIT Alewife machine and the Stanford FLASH 
receiver on the arrival of every message. In systems using machine, cache blocks for shared memory traffic use sender- 
sender-based interrupts, the sender specifies, by information based addressing, while bulk data transfers and send-receive 
contained in each message, whether or not an interrupt traffic use receiver-based addressing. Generally, sender- 
should be generated on the arrival of that message. ^ based addressing is used for random access communication 

Most conventional communication systems are receiver- and receiver-based addressing is used for protected, cross- 
based, such as the typical ethernet network using the Internet domain communication. 

protocol Most open, public local area networks use this kind Some similar work which combines both sender-based 

of protocol as welL and receiver-based addressing is a system which is described 

Several systems use purely sender-based addressing. For 65 in "Active Messages: A mechanism for Integrated Commu- 

example, there is a system called "Hamlyn,** of which an nication and Computation" in Int'l Symposium of Computer 

implementation in hardware is discussed in "Hamlyn: An Architecture, pp. 256-266, May 1992, by T. von Eicken et 
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al. In that system, the sender attaches to each message an events, resources sufficient to bypass the non-host processor 
address in the receiver of an interrupt handler which is events rather than all events, is all that is needed, 
invoked upon message delivery to extract the message from To obtain effective separation of these events, the receiver 
the network and deposit or process the message as desired processes a message using information in the message from 
The message may also include other arguments. The inter- 3 the sender. The receiver action in response to a message 
nipt handlers are constrained in length and action and are depends both on this sender information and information 
executed using the host processor in the receiver address stored at the receiver. A message therefore includes an 
space so that they run without the cost of a context switch indication of the operation desired by the sender, one or 
to a new thread A major problem with this system is that it more operands specified by the sender and one or more 
is restricted at least without hardware support to single user 10 operands which refers to some state maintained by the 
applications, because context switching is otherwise receiver. The receiver ensures that the desired action is 
required An additional problem with this system is that it permitted and then, if the action is permitted performs the 
also treats all messages at the receiver as requiring an action according to the operand specified by the sender and 
interrupt and thus does not reduce the impact on the pro- the state maintained by the receiver. 
cesson 15 The action performed by the receiver may be message 
By way of further background, William Dally presents in delivery, wherein an operand in the message specifies values 
Dally. W. J. et ai "Architecture of a Message-Driven for use in various addressing modes, such as direct, indirect 
Processor" , in International Symposium on Computer post-increment and index modes. Data is either written or 
Architecture, 1987; Dally, W. J. et al.. 'The Message Driven read from such addresses directly, without host processor or 
Processor: An Integrated Multicomputer Processing 20 operating system intervention while maintaining multiuser 
Element**. Proceedings of the 1982 IEEE International protection. The action may also be conditionally generating 
Conference On Computer Design: VLSI In Computers & an interrupt, wherein an operand is used in combination with 
Processors, Cambridge. Mass.. Oct 11-14, 1992; and Dally. the receiver state to determine whether a message requires 
W. J. et al., "Fine-Grain Concurrent Computing", in immediate or delayed action. The action may involve special 
Research Directions in Computer Science; An MIT 25 memory locations, called address registers, contained in me 
Perspective, edited by Albert Meyer, MTT Press. 1991. a network interface. This network interface is especially use- 
message driven processor system, in which a non- ful for asynchronous transfer mode (ATM) networks in 
conventional host processor executes communication which messages are compared of fixed size primitive data 
actions using built-in communication primitives without units called cells. 

benefit of a separate network interface. The primary disad- 30 The interface design also supports a multi-cell format in 

vantage of this system is this lack of a separate network wnicn a ^ cell in a stream of cells is a control celt and 

interface which would otherwise permit the use of conven- subsequent cells contain purely data. This enhancement 

tional processors. Protection issues, especially for user-to- provides increased bandwidth. 

user level comrnumcation, for a multi-user system are also u b ^ Mt to haye wfaich Q t0 

not atoessedAo^tionally this system is not well-adapted 35 ^ comroUed sharin ^ betwcenendpoirits in a flexible 

to ATM networks because the message format is mcompat- mma . Access to address registers is limited to a contigu- 

ous block of registers called a window. As with endpoints. 

In a multi-user network environment typified by a local- address register windows may be overlapped and nested 

area network (LAN), communication is a global resource for with addrcss registcx widows to allow controlled 

which protection must be provided to isolate a user on one 40 sharing between address register windows in a flexible 

processor from accidental or malicious interference from manner. Address register protection may also be provided to 

another user on another processor. Also, if the nodes are restrict access of the sender to different registers, 

multi-user as well, pr^ection must also be provided to Numerous ^ enhancements may be made, including 

similarly isolate users from ea* omer on me same proces- ^ exception handling and flow control, paging 

s 01 :™™ ^ ^ J?* W0 ^ « low 45 connection and endpoint tables, adding global ate 

overhead in this type of multi-user network and multi-user ^ and usin ^ to * ^ 

processor environment. look-aside buffer missis. 

SUMMARY OF THE INVENTION ft is also possible to provide higher level operations 

To overcome these problems and limitations with the 50 executable at the receiver as part of an instruction memory 

prior art. this invention provides a network protocol and accessible by an instruction pointer found in the operation 

interface for low overhead communication in a network of field of a message. Thus, the sender may be isolated from 

multiuser computers based on direct deposit messaging. some knowledge of the receiver. The full spectrum of 

Direct deposit messaging signifies directly demultiplexing operation of sender-based addressing and receiver-based 

messages and depositing both data and control information 55 addressing is thus provided 

directly where they are needed e.g., data in memory, and With this system, communication overhead is reduced by 

control in (conditionally) interrupting the host processor. In allowing the sender to specify as much as possible about the 

such a system, asynchronous events are controlled by sepa- intended action for the message, while still allowing the 

rating events by their need for the host processor. Events, receiver to control message reception for protection and 

such as data delivery, which do not require the host proces- 60 receiver-dependent operations. Thus, both sender and 

sor are handled directly, e.g., by depositing data directly into receiver information is used to demultiplex messages 

user memory. Events, chiefly synchronization, that require directly to where they are needed reducing latency. The 

interaction with the host processor are divided into irame- processor at the receiver is involved only when synchroni- 

diate actions that require immediate service and delayable zation is required That is , interrupts are eliminated for every 

actions that are accumulated and processed at some time 65 message; an interrupt is generated only when a message 

convenient for the host processor, thereby turning mem into requires immediate action. Thus, impact on the processor is 

synchronous events. With this separation of data and control reduced. This combination of control of asynchronous 



11/20/2003, EAST Version: 1.4.1 



5,790,804 

5 6 

events and direct deposit messaging provides flexibility and FIG. 9 is a block diagram of the receive side operation 

reduced overhead with both full protection and separation of logic of FIG. 8 showing protection and address generation 

control and data. This network interface and protocol is portions; 

applicable across a wide range of network and across a no. 10 is a block diagram of an example realization of 

reaftira^o ^ hC ^° nS m paraUd ' dlstributed ' and 3 the receive controller shown in FIG. ft 

re - computing. t FIG. 11 is a block diagram of a direct cache interface; 

In summary, a network protocol and interface using direct CTn , * . a ~ . 1 . .... ^_ . ' 

deposit messaging provides low overhead conjunction in ^ ^ ^ ^"J?*T 

a network of multi-user computers. This system uses both C °^. ^ 7 " CStabUsbcd; a ° d 

sender-provided and receiver-provided information to pro- to t ^ 13 15 a block cUa « ram of ^ recavc Sldc operation 

cess received messages and to deposit both data and control l0gic ° f raG - 8 to a second ™b<Kliment of this invention, 

information directly where they are needed: data in memory DETAILED DESCRIPTION 

and control information in conditionally/optionally inter- _ 

rupting a host processor. Message processing is separated , ^ CSCDt H,VCDtion wm bc morc completely under- 

into data delivery, which bypasses the host processor and is S ^* r0Ugh J foUowin g description which 

operating system, and message actions which may or may sh u ou ! d * ™ d " con j unction with mc attachcd towiD g "> 

not require host processor interaction. In this protocol a refcrcnce numbers indicate similar structures, 

message includes an indication of the operation desired by All references, including publications and patent 

the sender, an operand specified by the sender and an applications, cited above and following are hereby expressly 

operand which refers to some information stored at the 20 incoi P OTatcd bv reference. 

receiver. The receiver ensures that the desired action is A tyP** 1 computer system is shown in FIG. 1. and 

permitted and then, if the action is permitted, performs the deludes, for purposes of illustration, a first computer 

action according to both the operand specified by the sender (hereinafter called the sender) 50 and a second computer 

and (he state of the receiver. The action may be message (hereinafter called the receiver) 52. It should be understood 

delivery, wherein the operands in the message specify values 25 that the reference to sender and receiver are used merely for 

for use in various addressing modes including direct, MSC of illustration. The sender 50 and the receiver 52 are 

indirect, post-increment and index modes. The action may interconnected by a network 82 via respective network 

also be conditionally generating an interrupt wherein the interfaces 84 and 86. Although FIG. 1 shows two computers 

operands are used, in combination with the receiver state, to sen4cr 50 and receiver 52), the system is not limited to 

determine whether a message reqiiires immediate or delayed 30 two computers. There may be multiple senders and one 

action. The action may also be an operation on a register in receiver, multiple receivers and one sender or multiple 

the network interface or on other information stored at the senders and receivers. Also, each such computer may also 

receiver. The network interface and protocol are intended for comprise interconnected processors. Finally, the sender and 

use with local-area networks. Specializations of this inter- receiver may be interconnected processors within a single 

face and protocol are particularly applicable to a synchro- 35 computer such as one parallel computer. The term node as 

nous transfer mode (ATM) networks. The network interface uscd hcrcin signifies any sender or receiver. Each node is 

includes endpoints which may be nested and overlapped, assumed to have its own virtual address space (i.e., the nodes 

address registers which may be organized into windows m assumed to have virtual memory) distinct and indepen- 

which may bc nested and overlapped, address register pro- dcnt £rom nodes - The network is potentially 
lection and integration of exception handling and flow 40 used by multiple non-cooperating users and each node may 

control. he used by more than user. Thus, protection mechanisms are 

typically provided to protect against accidental or malicious 

BRIEF DESCRIPTION OF THE DRAWING interference with a process of one user by another user. For 

In the drawing, example, there are mechanisms to prevent one user from 1) 

FIG. 1 is a block diagram of a typical conventional 45 unauth °rizcd of messages to another user, 2) unau- 

computer system with a communication system using boozed access to memory of another user or 3) trying to 

receiver-based addressing* appear as another user to a receiver. These mechanisms are 

FIG. 2 is a flow chart describing a conventional commu- TOmmonly *™ d m stand ? d Sensible networks intcrcon- 

nication process for the cornputeTsystem of FIG 1; " ° fteo 001 found " dose * 

TTTsi i T- i ... - . 50 proprietary networks. 

FIG. 3 is a block diagram of a conventional computer . - A , 

system with a communication system using sender-based ™ C J cnd< * 50 £ dudcs a Pressor 54 connected to the 

addressing; network interface 84 and programmed according to a desired 

iTtn a i tt * . • ... ^ ^ operating system, illustrated at 56. The operating system 56 

FIG. 4 is a f low chart describing the operation of the <r Q ^ °„ t ' -.^—.^ . . . v^ a ^6 oj^iu 

computer system of FIG 3 isa com P utCT Program which manages node resources such 

inrr- * • wi • ' * 53 45 processor time and network access, arbitrates 

FIG. 5 is a block diagram of a computer system with a ^ protects apportions from each other, and controls 

communication system in accordance with the invention interaction between applications, such as appHcations 58 

us^gdirect deposit messatf ng; and 60 in FIG. 1 and Ae processor 54. The operating system 

FIG. 6 is a flow chart describing the operation of the 56 has associated therewith message buffers 62 and 64 

computer system of FIG. 5; w which are used to send messages across the network inter- 

FIG. 7 is a block diagram describing a suitable format for face 84 to the receiver computer 52. Each application 58, 60. 

a 53 byte ATM cell to be used in one enibodiment of this has associated therewith respective memory portions 66 and 

invention; 68 which may be called endpoint buffers or simply end- 

FIG. 8 is a block diagram describing one embodiment of points, 

the network interface in the communications system of the 65 Similarly, receiver computer 52 includes a processor 70 

computer system of FIG. 5 for use in ATM networks with the connected to the network interface 86. The processor 70 is 

cell format of FIG. 7; programmed according to a desired operating system 72 as 
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illustrated In FIG. 1. Similar to the sender* the operating 
system 72 and receiver 52 includes message buffers 74 and 
76. Applications 78 and 80 also have associated therewith 
memory portions 79 and 81, which also may be called 
endpoint buffers or simply endpoints. 

In general the sender or receiver may include other 
processors such as a network co-processor or other 
co-processors. To remove ambiguity, processors 54 and 70 
are referred to as "host processors'" herein. 

A message (shown at 83) in a conventional system 
includes a header which is used by the network 82 and the 
network interfaces 84. 86 to direct the message to the 
appropriate node and endpoint 

An application, such as 58, at the sender 50 and receiver 
52 communicate by sending messages to each other across 
the network 82. The message may include any data, includ- 
ing a procedure call. As shown in FIG. 2, communication 
conventionally involves at least the following steps. First, 
the application 58 invokes a send command in step 99, An 
example way for an application 58 to invoke a send opera- 
tion is by executing a command as depicted by the "send" 
command €1, The command includes an an identifier (ID) 
which is used in part to identify the receiver, a source 
address from which message data will be taken and a size, 
indicating the amount of data to be sent The operating 
system then copies the message in step 100 from application 
memory, such as an endpoint 66, to message buffers, e.g.. 62, 
in the operating system 56 of sender 50. This step is often 
optimized by mapping locations in the application memory 
to locations in the message buffer 62 to avoid actual copy- 
ing. The operating system 56 then performs protocol pro- 
cessing if necessary in step 102. That is. the data to be sent 
is placed into the proper format as may be required by the 
network 82. The message, or perhaps several messages if the 
amount of data is large, is then injected in step 104 into the 
network 82 through network interface 84. 

Usually, message arrival at the receiver causes an inter- 
rupt to the processor 70. and the operating system 72 directly 
extracts the message in step 106 from the network (meaning 
network and network interface) and copies it into message 
buffer 74 in the operating system. Alternatively, in a system 
with appropriate hardware, the operating system may set up 
a direct memory access (DMA) with the network interface 
to extract and copy the message to message buffer 74. 

In some communication schemes an intelligent network 
interface, or perhaps a second processor at the receiver 52 
(e.g., the communication co-processor in the Intel Paragon) 
extracts the message from the network 82 and copies the 
message to the message buffer 74. The operating system 72 
then performs protocol processing, if necessary, in step 108, 
then copies the message in step 110 from the message buffer 
72 at (he receiver 52 to application memory such as endpoint 
79. Usually the operating system 72 copies this data to 
application memory in response to an explicit receive 
request by an application, such as a 'receive* command 77. 
The command includes an ID' which is used in part to 
specify the message buffer, e.g., 74, from which data is to be 
received, the receive address to which the message should be 
copied, and a size indicating the amount of data. It is also 
possible for the operating system 72 to automatically copy 
the data according to some previously saved state informa- 
tion. As in the sending side, this copying is often optimized 
by mapping locations in the application memory to locations 
in message buffer 74 to avoid actual copying. 

A communication system has overhead which includes 
both communication latency and impact on the processors 
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54 and 70 of the sender 50 and receiver 52. For sake of 
simplicity, communication latency and communication 
overhead are referred to herein merely as latency and 
overhead. Latency may be defined as an amount of time 

5 taken far a message to be transferred from application 
memory at the sender 50, such as endpoint 66. to application 
memory at the receiver 52, such as endpoint 79. Impact on 
the processor involves interrupt handling, data flow control 
and protocol processing. Overhead is reduced by optimizing 

10 the steps described above in connection with FIG. 2 so as to 
reduce latency and impact on the host processor. 

Low overhead is important for applications which require 
rapid response behavior, such as parallel and distributed 
computing systems and real-time control systems. In parallel 

15 computing systems, low latency is essential to reduce the 
amount of time a process at a sender 50 waits for data to be 
read from a remote memory location, e.g.. application 
memory 79, and for remote synchronization operations (e.g., 
obtaining and releasing locks) to be completed. In distrib- 

20 uted computing systems, performance of a client-server 
model is often limited by the amount of time required to do 
a remote procedure call (RFC), which is affected by latency. 
The importance of low latency is perhaps most obvious in 
real-time systems where an inordinate delay in coramuni- 

25 eating a control input may lead to disaster. 

Even when low latency is not essential for a given 
application, it may increase the spectrum of possible appli- 
cations and the flexibility in structuring a system. In parallel 
computing systems, lower latency enables the efficient 

30 exploitation of more finely grained computations and thus 
increased parallelism. In a distributed computing system, 
sufficiently low latency may make paging, e.g. by sender 50, 
over a network 82 to memory in a remote node, e.g. receiver 
52, faster than paging to a local disk (not shown). Finally, 

35 low latency could help make client-server based computing 
systems attractive for realizing flexible real-time computing 
systems. 

Low impact on the host processor is important to mini- 

^ mize the degradation on applications due to reduced and 
unpredictable processor availability. Predictability is par- 
ticularly important for real-time tasks performed by the host 
processor. It is important to insulate such applications from 
unrelated asynchronous network events. 

45 Current generation parallel computing systems with pro- 
prietary networks obtain latencies in the range of 1 usee to 
100 usee It is desirable to have latency in a local-area 
network be no more than 1000 cycles, which for future 1 
GHz processors, is 1 usee. In conventional 10 Mbps Ether- 

50 net LANs, latency is typically about 1 msec First generation 
100 Mbps ATM networks can achieve about 250 usee 
latency using conventional network protocols and interfaces. 
Because increasing the speed of a network does not neces- 
sarily reduce latency, to achieve lower latencies improve- 

55 merits are needed in the operating system, the network 
interface and the network protocol. 

The conventional approach will now be described in more 
detail in order to identify the obstacles to obtaining low 
overhead communication. Conventional communication 

60 systems in both distributed systems (e.g., TCP/IP 
implementations) and parallel computing machines (e.g., the 
Intel Paragon) are oriented towards bulk and stream data. In 
such systems, messages, often large in size, include a buffer 
identifier (ID) and data. A combination of interface hardware 

65 and operating system software demultiplexes arriving mes- 
sages via the buffer ID into sequential positions in the 
identified message buffer, e.g.. 74 in FIG. 1. 
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Id almost aU cases, the network interface generates an sage is to be deposited in the receiver, i.e.. by including the 

interrupt to the host processor. The operating system then * receive address". Thus, a message such as 94 sent from the 

either directly copies data from the network interface to a sender 87 to the receiver 88 includes not only header and 

message buffer 74 within the operating system or sets up a data information but also the address at the receiver 88 into 

DMA transfer which accomplishes this message copy. 5 which the message 94 will be placed. The sender may also 

Because this message buffer 74 is within the operating directly specify whether an interrupt is to be generated upon 

system 72 as described earlier, the operating system 72 at the message arrival at the receiver. 

receiver 52 must copy the data from the message buffer 74 FIG. 4 is a flow chart describing the general operation of 

to application memory, such as endpoint 79. One way to the system of FIG. 3. First, the sender determines an address 

eliminate the copying overhead of such buffering is to map 10 where the message is to be deposited at the receiver. The 

the application memory of endpoint 74 to the message buff er sender invokes the send command in step 111 with this 

72. Alternatively, the data could be copied directly to appli- address. The message is then injected into the network in 

cation memory from the network interface. step 112 directly from the application memory, in contrast to 

Regardless of the implementation method, messages are me conventional system shown in FIG. 2 which injects a 

really only demultiplexed to sequential locations in either 15 message from an operating system level message buffer, 

the message buffers in the operating system or in the Next, the message is then extracted in step 113 and from the 

application memory (if copied there directly from the net- network interface, the message is demultiplexed directly 

work interface): the position at which a message is stored into the receive address in the message (step 114). Again, 

within a message buffer is determined implicitly, e.g., by ^ is in contrast to the conventional network system which 

sequencing a pointer. Also, typically, every message M extracts it to an operating system level message buffer, 

received causes an interrupt to the processor and operating In this system, copying of message data from application 

system at the receiver to extract the message from the memory to and from operation system message buffers 

network to the message buffer. Because the sender has no (steps 100 and 110 of FIG. 2) is omitted. Rather, a message 

direct control over either the final address of a message, or is copied directly from application memory in the sender to 

the interrupt status of the receiver on receiving the message, 23 application memory in the receiver. Interrupts to the pro- 

this form of message handling is appropriately called cesser may or may not be indicated by the message. The 

receiver-based addressing. interrupt state is solely dependent on the specification by the 

It is common with such a conventional network protocol sender. Various realizations of such a system differ in the 

to multiplex application and message processing on a single M presence, amount and details of protection, as indicated by 

processor. However, this multiplexing introduces significant 97, 98 in FIG. 3. 

overhead, because many cycles are required to handle an Such a network protocol and interface has been used 
interrupt First a trap into the kernel is made, then many primarily for parallel machines, such as the Tera 3D avail- 
cycles are used to demultiplex the interrupt. Finally, an able from Cray Research, Inc., of Minnesota, and the 
interrupt handler Is executed. This multiplexing overhead 35 Stanford University DASH machine. Such a parallel 
both incurs delay in message delivery and temporarily machine with a global address space uses a message, usually 
postpones application processing at the receiver. The unpre- small in size (e.g., word or cache block), which carries both 
dictible nature of application interrupts caused by asynchro- an address and data The data is stored directly in the 
nous message arrival is a problem for real-time systems. receiver address contained in the message. With this method. 
Frequent interrupts also degrade application performance. ^ demultiplexing at the receiver is trivial; the sender specifies 
Addressing in early multicomputers such as the Intel iPSC *U me information. Consequently, this form of message 
used such an approach: the operating system kernel at the handling is called sender-based addressing. The analogous 
receiver demultiplexed messages to message buffers and the term "sender-based interrupts** is used to describe the inter- 
application demultiplexed the buffer contents to application m & generation. 

memory. Kernel involvement remains state of the art in the 45 Because such a parallel machine is often assumed to have 

parallel workstation area, such as the SP1 developed by a single user, or multiple users interacting benevolently. 

International Business Machines. An alternative solution protection issues are often not addressed in their internal 

used with some conventional communication protocols is to network systems which are used exclusively for intercon- 

add another full processor as a communication co-processor, necting parallel processors. That is, protection 97, 98 in FIG. 

such as on the Intel Paragon, which is used to handle ^ 3are often omitted. 

asynchronous message arrivals. Such a communication This invention overcomes problems in the prior art by 

co-processor is also useful for handling complicated gather- directly depositing messages where they are required. FIG. 

scatter operations, which arise when large messages are 5 ^ a block diagram of a communication system in accar- 

used. Since this co-processor duplicates hardware, it is dance with the invention. It includes a sender computer 270 

expensive. 55 and receiver computer 272 interconnected by a network 82 

Instead of devoting significant resources and complexity via network interfaces 274 and 276. The sender and receiver 

at the receiver, like the co-processor approach, to determine computers 270, 272 each have respective processors 54 and 

where to deposit and how to handle messages, the sender can 70 programmed according to a desired operating system 278 

do this determination. Such a known, though less conven- and 280. For the purposes of illustration, the terms endpoint 

rional approach for a communication system is shown in the 60 and connection will be used. These terms should not be 

block diagram in FIG. 3. In this system, a sender 87 and construed to limit the invention, as the invention is appii- 

receiver 88 are connected by network 82 via network cable to many types of computer networks. As used herein, 

interfaces 89 and 90. The respective operating systems 91 an endpoint signifies a contiguous region of virtual memory, 

and 92 need not have message buffers. In this system, the A connection is a virtual channel authorizing communica- 

sender specifies where a message will be deposited in the 65 tion between a pair of endpoints. A connection may also be 

receiver. As indicated by the example command 95 in FIG. multi-cast i.e., not restricted to pairs of endpoints. Appli- 

3. the "send" command now also indicates where the mes- cations 58 and 60 being executed on the sender 270 and 



11/20/2003, EAST Version: 1.4.1 



5,790,804 

11 12 

applications 78 and SO being executed on receiver 272 each action is taking some action in response to reception of a 
have one or more endpoints assigned to them, e.g. re spec- message, such as returning a value, performing a read 
tiveiy 66, 68. 79 and 81. Each of the operating systems 278 operation, notifying a task that the data has arrived, enabling 
and 280 store connection state and mapping information 282 » ^sk on the scheduler queue, which is a data structure 
and 284 respectively, which are indicative of the states of the 3 indicating tasks eligible to be executed by the processor, or 
sender 270 and receiver 272. This information preferably is invoking an arbitrary interrupt handler, e.g.. a remote pro- 
cached in the network interfaces 274 and 276 as indicated at ce ^ e ^ ( RPC ?* . . jt J , 
286 and 288 Whether an action is immediate or delayed depends on 1) 
' A a£ _ « ~- A „ _ . when a remote process awaiting the result of the action (if 

A nicssage290 sent from .the sender ^0 o ttie receiver } hereafter called the waiting task (not always the 
272 includes both control uifonn^on and Tr^ control to ^ ^ ^ ^ ^ ^ 
information may include an mdicaUon of an action to be ^ ^ of othcr ^ me 
performed, one or more operands indicative of a state of the Somc examples of immediate actions are a read or synchro- 
sender and one or more references to information stored at ^^on operation where the waiting task needs the result to 
the recerver. This information stored at the receiver may also ^ and a high priority control operation, such as some 
be called state maintained by the receiver or receiver state. 15 opcratmg system actio& . Some examples of common 

FIG. 6 is a flow chart describing generally the operation delayed actions are notifying a task that data has arrived and 

of the system shown in FIG. 5. The sender first generates enabling a task on a scheduler queue. The related message 

separate control information and data and constructs a ^ ^ refereed to as not requiring immediate action, 

message in step 115. As indicated by the example command ^ ^ a systcm _ te structured so that an 

101 in FIG. 5. the "send" command includes an identifier of j^^^ espouse is not necessary. For example, a remote 

a connection, an address from which data is to be taken, oode ^ execute another task while it awaits a response 

control information and an indication of the amount of data fTOm a actic>n> Qf course, if a message is destined 

to be sent for a waiting task which is not currently active anyway, e.g.. 

This message is injected directly from application ^ notifying an inactive process that data has arrived, any 

memory into the network in step 112. Next, the message is action in response to that message may be delayed. When an 

extracted from the network into the network interface in step action may be delayed, the message may be queued for 

113. In step 116, the network interface demultiplexes the processing at a later, more convenient time for the receiver, 

message, depositing data directly into memory and/or con- Thus, a message for which an action may be delayed 

diu'onally delivering interrupts to the processor. JQ becomes a synchronous event. Conveniently, queuing a 

A message originating at a source endpoint bypasses the message is merely message sending and a queue pointer 

operating system and host processor at the sender. At the update. 

receiver, the operand may be compared to connection state To implement this kind of protocol, herein called direct 

information to determine whether an interrupt should con- deposit messaging, a message contains the connection, 

diu'onally be delivered to the processor 70 and the receiver 35 which is an identifier which implicitly identifies the receiver 

272. Also, these operands may be combined with endpoint control information indicating both an action to be 

connection, state and mapping information to determine the performed and one or more operands, and data, Each oper- 

address in memory in 79, for example, to deposit the and may be an address to be used by the recerver, and/or may 

message data. be a parameter used to determine whether the specified 

Thus, in this invention, copying of message data from 40 action must be performed immediately or may be delayed, 

operating system message buffers to application memory is and/or may name some receiver state. Addresses are 

omitted. Furthermore, interrupts are conditionally generated encoded as an offset from the base of the endpoint The 

according to both sender state and receiver state. Also, an offset is essentially a network logical address that insulates 

address into which message data is deposited is determined the sender and receiver from the addressing details, eg. 

in part by sender state and in part by receiver state. Thus, the 45 address space size, virtual to physical mappings, and page 

sender may be isolated from too much knowledge of the size, at the other. This separation promotes modularity and 

red ever. accommodates node heterogenity. Furthermore, an offset 

This system is generally based on the observation that typically does not need the full dynamic range of a virtual or 

message sending is simple, whereas message reception is physical address and thus can be encoded in fewer bits 

complex because of the asynchronous nature of message so within a message. 

receiving. Message handling at a receiver should therefore A set of primitive actions, representing common opera- 
be separated into message delivery and message action. This tions that may be implemented simply without host proces- 
separation allows control of asynchronous events, wherein sor or operating system intervention, is provided. More 
events are distinguished by their need for the host processor complex actions are left to the host processor. The primitive 
at the receiver. Events, such as message delivery, which do 55 actions described herein are simple data transfer, Le„ read 
not require the host processor are handled directly. Other and write to endpoint locations and conditional interrupts to 
events, chiefly synchronization, that require interaction with the host processor for delayable or immediate actions, 
the host processor are further divided into actions that The simplest operations are pure sender-based direct read 
require immediate service and actions that can be delayed and write data transfers. For a direct write, the sender 
and accumulated and processed at some time convenient for 60 specifies the source data by its offset from the base of the 
the host processor, thereby turning them into synchronous source endpoint and the receiver location by the offset from 
events. With this separation of data and control events. the base of the receiver endpoint Messages contain the 
resources sufficient to bypass the non-host processor events, receiver offset and the data. For reads, the source sends a 
rather than all events, is all that is required. message with a direct write request to the recerver, along 
Message delivery simply involves depositing a message 65 with the offset in the receiver and the deposit offset (for the 
in a desired location in the memory of the receiver, e.g.. by reply) in the source, and an indication of a reply connection 
a remote write or a direct memory access (DMA). Message if the connection is not duplex. 
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To enable actions which are a function of both sender and significant bit of addreg, set to block interrupts. This imple- 

receiver state, the receiver end of each connection maintains mentation assumes both unsigned comparison and fewer 

some state, i.e.. stores some information which message queues than bits in addreg,. This second assumption could be 

operands may name and so obtain receiver addresses. To relaxed by using multiple address registers. Messages may 

simplify matters, this state is contained in specially addres- 5 also be appended to a queue within a specialized endpoint 

sable locations which herein are called "address registers'*. within the operating system, enabling delayed actions 

This state could also be held in general memory locations. involving the operating system. 

Thus, message actions are a function of an operation sped- As other examples, various atomic operations such fetch- 
fled by the sender, operands representing sender state, and and-increment, read - modify -write , and compare-and-swap 
receiver state, such as the contents of the address registers, to can be implemented by devoting one or more of the address 
The following is an example set of primitive operations registers for the target location and using the register opera- 
using sender and receiver state. tions for incrementing and comparing. Barrier synchro niza- 
Address Generation tion can also be implemented this way. When a process 

1. Direct addressing: effaddr=operand reaches a barrier point it toggles a bit in a specified address 

2. Indirect addressing: eff a ddr=<addre &> 15 register of all processes in the "barrier set" and then waits for 

3. Indexed addressing: effaddr=«addreg ( >+operand> a conditional interrupt when ail the bits are set (or cleared). 
Register operations This protocol provides at least the following benefits. 

1. addreg,<— operand First, the combination of sender state, receiver state, and 

2. addreg,*— unary-op <addreg i > operations on the two is very powerful. The full superset of 

3. addreg,«-<addreg j > binary-op <addregy> where i and j are 20 capabilities of both conventional, receiver-based addressing, 
not necessarily different and sender-based addressing Is possible. The mix of sender 

Conditional operations and receiver information can be varied on a per message 

1. if (<addregj> compare-op operand) then generate inter- basis to accommodate different requirements on what the 
nipt at end sender knows, or alternatively, different requirements on the 

2. if (<addre&> compare-op <addregy>) then generate inter- 25 isolation of knowledge between sender and receiver. Indi- 
rupt at end where i and j are not necessarily different rection provides protection by isolating the sender from too 
Some form of address generation unit calculates an effec- much knowledge of the receiver because the actual storage 

tive address (effaddr) at which to read or write data. <X> address is partially a function of an address the sender may 

denotes the contents of memory location X. "Operand** may not know. Similarly, the interrupt status is partially a func- 

be data or other operand in a message. The message opera- 30 tion of receiver information contributed (via messages) by 

tion controls whether a read or write occurs to memory, (he processes that the sender may not know exist As will be 

primitives selected, and their order. The conditional test may described below, a mechanism may also be provided for 

occur at any time but the interrupt preferably occurs at the register protection that prevents a sender from accessing by 

end of the compound operation. read or write or otherwise determining or modifying the 

These primitive operations allow a rich set of powerful 33 contents of specified receiver state, such as address registers, 

and flexible compound operations. For example, an indirect Second, predictability of computation, which is important 

write with postincrement can be synthesized with an indl- in real-time systems, is increased by restricting control flow 

rection followed by a register operation: interrupts to well-defined points. In effect the asynchronism 

<addregj><— MSG and nondeterminism is eliminated from asynchronous net- 

addieg^<addkeg,>+operand 40 work events. 

The last step may also be: a<Mreg,<—<addreg,>+<addreg / > Third, action handling is more efficient and results in less 

Done on a per-cell basis, this compound operation is equiva- overhead because interrupt overhead is amortized over raul- 

lent to DMA with stride equal to the increment value. tipie actions. Polling can be used to synthesize hybrid 

However, note that varying operand or <addreg # > yields interrupt-polling action methods. 

variable strides. 45 Finally, more complex operations can be formed by 
As another example, priority queueing and interrupts can combining the result of multiple compound message opera- 
be synthesized as follows: tions. 

<addreg,><-MSG This protocol is preferably endpoint and connection 

addreg^^— <addre^ J >-Kaddreg J > based. Endpoints and connections are allocated and deallo- 

if (operand greater than <addre&>) generate interrupt at end 50 cated with kernel calls. Preferably, endpoints are page- 

addreg,4-<addreg,> bitwise-or operand aligned. Thus, host virtual memory page protection is also 

where "operand" indicates the priority of the message, used within endpoints. A connection may be established 

addreg^ points to the end of the queue to which a message between any pair of endpoints. including endpoints on the 

of this priority should be added, addreg, contains the size of same node. The connection establishment protocol is much 

MSG. and addreg, holds the priority level at the receiver 55 like session establishment in Berkeley UNIX sockets. Some 

where p, i and s are different from each other. The message out of band mechanism, such as a boot-time agreed upon 

specifies the operand and register indices p, i, and s. kernel endpoint and connection, is used to arrange allocation 

Complex compound operations like this priority queueing of the endpoint and connection in the receiver, 

may require multiple compound operations. For example. An endpoint may have multiple originating connections 

two compound operation messages would be required in this 60 and/or multiple tenninating connections. Connections can 

case if the receiver executes one register operation per be simplex, duplex, or multicast Connections originating 

message. from or terminating on an endpoint all share the same 

As a final example, sometimes it may be convenient to mapping information. However endpoints can be overlapped 

append messages to one of several different queues without or nested to form more complicated protection patterns. For 

generating interrupts and maintain a bit vector of non-empty 65 example, connection A could create an endpoint with virtual 

queues. This mechanism can be implemented in generally address bounds (v,, v H ). Then connection B could create a 

the same way as priority-based interrupts, but with the most second endpoint that is a proper subset of this range to allow 
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connection A access to all of B, but B to only access a field 127 contains a simple check sum over the prior control 

portion of A. Or, connection B could create an endpoint that fields so that decoding of the control can begin without 

partially overlaps with A* s (v,, vj range to allow connection waiting for the entire cell to arrive. 

A and B a limited range in which to share without exposing For a read request to a remote node, the data field 122 

their entire respective endpoints to the other. Different 5 contains the control field for the reply write message. There 

protection schemes can also be realized by mapping the is also a multiple cell message format for block transfer. In 

physical pages of an endpoint to virtual address ranges with this format, the first cell is a "control" cell in the write format 

different page protection. shown in FIG. 7. and the following cells are standard AAJL5 

Network protection is provided as follows. Access to the (ATM Adaption Layer 5. an ATM signaling standard) cells, 

network via out-going connections is controlled by per- 10 To avoid complexity with cell boundaries and length and 

connection state maintained by the kernel. Messages arriv- CRC in the last AAL5 celt all block transfers are multiples 

ing from the network check for authorization with the of 16 bytes. 

receiver connection state maintained by the receiver kernel. As mentioned above, the direct deposit model of com- 

Authorization to receive a message from an incoming con- munication herein described is endpoint and connection- 

nection implicitly authorizes the message to write in the is based: an application allocates endpoints. sets up connec- 

associated endpoint However, the receiver address must tions between the endpoints, and then sends messages over 

still map to a legitimate endpoint address and the operation these connections. To support such operation with the direct 

must be permitted by per-page access rights. deposit model of communication herein described, the opcr- 

Protection can also be provided in such a system by ating system at each node maintains the following data 

having the receiver verify that the operation requested and 20 structures. However, a hardware implementation may cache 

the address used at the receiver are legitimate. For example. some or all of these data structures to support high speed 

a memory region can be specified for each application. If the operation, as will be described in more detail below in 

address to which data is written is not in the specified region, connection with FIG. 9. 

access is denied. An endpoint table includes an entry for each endpoint at 

An implementation of this system, specialized to ATM 25 that node (e.g.. sender 50), indexed by an endpoint number, 

networks, will now be described. It should be understood Each entry in this table contains indications of a base 

that the following is just an example, and that the invention memory address for the endpoint, e.g. in memory 79 or 81 

may be implemented for other networks and in different at the receiver 52. endpoint size, virtual to physical mapping 

ways. information access information, e.g.. private, read only or 

FIG. 7 shows one possible format of a 53 byte ATM cell 30 shared, all open connections to the endpoint and any 

120 for this implementation. In the simplest format, a cell processes attached to the endpoint 

includes data and control information along with the stan- A connection table includes an entry for each connection 

dard network header and other information The sizes of the originating or terminating at mat node, indexed by the 

data fields in this format are merely exemplary and are not connection number. Each entry in this table contains an 

intended to be limiting. 35 indication of an endpoint number, address register base and 

In FIG. 7. an ATM header field 132 contains link routing, bounds, connection state information, and reply connection 

which indirectly identifies the receiver, and traffic control information. 

information. This field is five bytes and is in a format A node address table is also used. Each entry for a node 

suitable for processing by a standard ATM switch. The includes the name of a remote node and the connection 

connection number is encoded in a virtual channel/virtual 40 number for a connection (direct or Indirect) to the operating 

path identifier (VQ/VPI) field (not shown) in this header. system of the remote node. The index to the table is a unique 

There are also a field of 32 bytes of data 122 and 16 bytes global identifier for each node. This table is used to contact 

of control data (discussed below) per cell 120. The data field remote nodes for connection set up. This table may also 

size matches memory and cache block sizes of 32 bytes and contain naming information via some alternative signalling 

thus enables fast, efficient hardware implementations. Data 45 mechanism, such as Internet-protocol (IP) addresses, 

masks can be used to eliminate unwanted message data at Data are delivered to endpoints and interrupts are deliv- 

the receiver, as explained shortly. ered to the operating system. Specifically, data are delivered 

A cyclic redundancy check (CRC) field 124 of two bytes to the endpoint at the receiver of a specified connection and 

may also be provided, to correspond to the data field 122. at not to processes which may be attached to the endpoints. 

the end of the cell 120, to help prevent an errant message 50 Interrupts are delivered to the operating system running on 

from being interpreted as a valid message. Two other bytes the receiver and not to specified processes. The operating 

are unused. system may then deliver interrupts to specified processes. 

The control information includes a four-byte operation Operation setup for such a system will be described in 

field 130 which specifies the type of operation to be per- connection with the flowchart of FIG. 12. The sender 50 first 

formed at the receiver. This operation field of 130 may 55 allocates an endpoint in its virtual address space in step 200. 

include a mask field 131 and opcode field 129. The opcode A free slot is found or made in the endpoint table and is rilled 

field specifies the operation, whereas the mask field can be with the base address of the endpoint, and virtual to physical 

used to deselect the reading or writing of four byte words mapping information. 

within a block of the data field 122. That is, bit i in the mask Then, via an ahem ate connection, perhaps a dedicated 

controls whether data word i is read or written. This feature 60 operating system connection or an alternate network like a 

is useful to update a location without changing the values transport control protocol/internet protocol (TCP/IP 

(e.g.. variables) in neighboring locations in a block. A connection, in step 201 the sender contacts the intended 

four-byte operand field 126 is also provided. The operand is receiver and requests a connection be setup with an appro- 

a 32 bit immediate source operand (offset or data). Desti- priate endpoint buffer size. The receiver then allocates that 

nation operands are specified via three separate register 65 size buffer region in its virtual address space, finds or makes 

indices encoded in a four-byte index field 128. These sepa- a free slot in the endpoint table, and fills it with the buffer 

rate index fields are shown at 121. 123 and 125. The check base address, and virtual to physical mapping information. 
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The receiver then acknowledges the connection in step 203. 
In a multicast connection, this procedure is repeated for each 
sender-receiver pair in the multicast. Messages containing 
an offset from the base of the endpoint may then be seat over 
the connection in step 204. 

This protocol can be implemented without special- 
purpose hardware using a computer program on a commer- 
cially available computer and network interface. Such an 

implementation has been made using two DECStation 5000T 10 The first row is the average time to send a cell via the 

240 workstations, available from Digital Equipment Corpo- illegal instruction trap send. The second row is the round trip 

ration. Each workstation had a Fore Systems TCA- 100 ATM send-to-receive time corrected for loop and measurement 

network interface plugged into its *TURBOchanneH I/O bus overheads. The first iteration column lists the time on the 

and the two workstations were connected back to back, i.e., first iteration, and the average best column lists the average 

without an ATM switch. The DECStation 5000/240s had a 15 of mc thacs 0D 19 following iterations. After the first two 

40 MHz MIPS R3000 processor. 64 Kbyte direct mapped iterations typically there was very little variation in the 

off -chip instruction and data caches, and 32 Mbytes of main tim £f* ^ ., ^ . , _ . ^ . . . . 

~ . , • _i j * The first iteration incurs cache and translation-lookaside- 

mCm ^ lll ! J mam mCm ^ V0 Subs 7 StC ^; m e 1U< ^ g buffer (TLB) misses which causes the send-to-receive time 

the 32-bit wide TURBOchannel, operated at 25 MHz. The of ^ itcration to be much greater man that of subsequent 

Fore TCA- 100 was a very simple interface containing two & iterations. (However, it is not clear why the send to receive 

FIFO queues, one for transmit and one for receive, and some time is so large on the first iteration. The first iteration send 

control and status registers. An ATM cell was transmitted by overhead is much more reasonable). After all cache and TLB 

the processor writing fourteen 32-bit words representing the misses and other transients have dissipated, the round trip 

5 bytes of ATM header, 48 bytes of ATM payload, and 3 timc was 49 1 psec, which means that the best case one-way 

bytes of padding over the TURBOchannel to the TCA- 100 23 send-to-receive time for a remote write was about 24.5 usee, 

interface. An ATM cell was received by the processor which is about 80 times faster than a similar test using Fore 

reading fourteen 32-Wt, words. Although the TURBOchan- Sy ^'" f J^*"" im P ,cmentation on same hardware 

nel supported DMA. the TCA- 100 did not use it Toe ninnmg UltJix 4J. 

TCA- 100 either generated a TURBOchannel interrupt when M ™ c bandwidth was measured by sending a sequence of 

a cell arrived or a receive cell counter on the TCA- 1 00 was * consecutive remote writes as found below in Table 2. 

polled to determine if any cells have arrived. The data rate tatu p 1 

over the fiber connecting the TCA- 100s was 140 Mbps. The TABLE 2 
DECStations ran the Carnegie-Mellon University (CMU) 
microkernel-based operating system Mach 3.0 (MK83, 35 
UX41), for which full source code is readily available, 
including full source code, from CMU of Pittsburgh, Penn- 
sylvania. 

A simple remote write function was implemented on this 
system for experimentation purposes. In this 40 _ 

implementation, a 32 byte block of data at a given offset Theblock size is me number of consecutive remote writes. 

„ . t . ' „ . „, QC . „. IT q mAmr The bandwidth increases as the block size increases since 

from a endpoint in the sender was delivered to a sender . ^ . . , _ - , 

• . . „™. , , . . . . the uiterrupt overhead is amortized over more data. The 

supphed offset in a sender named endpoint in me receiver. ^-lW fctemipt handlcr ^ ^ ^ mc reccivc prpo 

The data offset ^and buffer InfonnaUon were packed into a 4J ^"j^^ are sent sufficiently close together; 

single cell. The data block size was 32 bytes, for agreement 0fll 0QC intcmj ^ and hcQCC onc ^ through the interrupt 

with cache block sizes, restricting the offsets to be 32 byte, ^ hanin ^ may be required to read all the cells. Since 

block aligned for implementation ease. 5cnding ft ccll was fast ^ mc mcgdl ^5^^ trap 

The low level Mach kernel exception handling code was method* this amortization effect is easily obtained, even 

also modified to send a ccll via an illegal instruction trap. On so when consecutive blocks are sent using an ordinary "for" 

the receiving side, the Mach microkernel was modified to loop. The asymptotic bandwidth achieved is thus a measure 

partly optimize the interrupt path. In particular, th e TCA- 100 of the per cell overhead in processing a cell from the receive 

handler was called directly from the kernel interrupt trap FIFO. 

hand] ex The 24 Mbps bandwidth is based on 32 data bytes per cell. 

Usingthe in^lementadon d-foll^ * SlS ^SS& ^Tu^s p^et 

experiment . *w performed, torn user level a previous y ^ ^ J* ^ a ^£_ 

stored block of data was sent from the source workstation to widih rf aboul 35 ^ ujd 32 ^ ^ Bus 

mereceiver.Auser^ the partly optimized single cell remote write implementation 

testing me endpomt area for w d^bed above is obtaining about 68% of the practical peak 

data arrived, the receiver process sent it back to the source bandwidth mat can be obtained with 32 data bytes per cell, 

workstation where another user level process was running. Even so me 24 Mbps bandwidth is significantly better than 

testing for the arrival of the data. The total round trip time the 14 Mbps peak bandwidth measured for Fore System's 

was then measured. This includes two onc way remote AAL3/4 implementation with the same hardware running 

writes, loop overhead, since the round trip was repeated $5 Ultrix 4 J. 

twenty times, and measurement overhead. The results are The 24 usee latency reported above breaks into the 

listed in Table 1, below. components shown in Table 3. 
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TABLE 3 




Latency breakdown 


TURBOchanuel time 


7 Msec 


ATM cell time 


3 Msec 


CPU and memory time 


14 Msec 



Since the TCA-100 ATM interface does not have DMA. 
all accesses to the interface to send and receive cells use 
programmed I/O. Programmed I/O is slow on the 
TURBOchannel, resulting in nearly one third of the latency. 
The ATM cell time refers to the time far 53 bytes (one cell) 
to completely collect at the receiver at the 140 Mbps data 
transfer rate used by the fiber connection. The remainder of 
the 24 usee (14 usee) is. of course, the CPU and memory 
time at the sender and receiver, and is 68% of the total 
latency. Consequently, optimizing the interrupt handler code 
by 20% will only reduce the total latency by about three 
usee. Thus, although there may be room for optimization in 
the current implementation, only diminishing returns would 
be obtained. Thus, it is probably fair to conclude that 
approximately 20 usee is the minimum latency for remote 
write with this common commercially-available hardware. 

In view of these experimental results, special-purpose 
hardware should be added for protection and for direct 
depositing of data in memory to reduce the load on the 
processor, and to reduce memory and I/O bus time. This 
hardware support should reduce the end to end latency to 
close to single cell time: 2.7 usee at a data rate of 155 Mbps. 
and 0.68 usee at a data rate of 622 Mbps. Hie reduced load 
on the processor could also be important for a server, such 
as file server, in a distributed system that has a heavy 
communication load. Also, variation in latency is also 
reduced and even the worse case latency can be good 

Such an implementation In hardware of an architecture of 
a network interface for supporting communication in accor- 
dance with this invention will now be described. Because a 
node typically may act both as a receiver and sender, a 
network interface for a node should handle both sets of 
functions. The functionality of the receiver 52 will now be 
described and includes address mapping and protection, 
address registers, control for data and address paths, and 
flow control. The sender will be described below. 

The implementation has two parts. The first part is a front 
end architecture for address mapping, address registers, and 
control mechanism functionality. This part has multiple 
possible embodiments, each having generally the same 
functionality. A general block diagram is provided in FIG. 8. 
of which multiple embodiments are described below. The 
second part, the back end which connects the network 
interface to the host memory, also has multiple possible 
embodiments. Three of these are also described below: a 
direct connection to the main memory, a traditional I/O bus 
connect or a direct connection to the secondary cache. 

The front end architecture will now be described in 
connection with FIG. 8. The front end of the network 
interface is shown generally at 210. The front end connects 
to the host memory 212 via a bus 214. The connection of 214 
to host memory 212 is called the back end which will be 
discussed in more detail below. The front end 210. on the 
receive side, includes a receive buffer memory with flow 
control 216. This receive buffer is preferably a flrst-in. 
first-out (FIFO) memory element A header splitting and 
checking unit 218 processes incoming cells and demulti- 
plexes the information to VCFVTT mapping unit 220, con- 
trol decoder 222 and data splitting and checking unit 224. 
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The data splitting and checking unit passes blocks of data to 
block transfer unit 226. which can transfer data to the host 
memory across the back end 214. The VQ/VPI mapping 
unit 220 determines a connection number and applies it to 

3 operation logic 230. The operation logic will be described in 
more detail below in connection with FIG. 9. 

Control decoder 222 decodes the control portion of the 
incoming cell and determines an index and operand which 
are provided to operation logic 230. Hie index and operand 

10 are the same values indicated in the ATM cell of FIG. 7. The 
control decoder 222 also outputs the opcode found in the 
ATM cell to a receive controller 228 which provides control 
information to the operation logic 230. The operation logic 
outputs, an address, state information, a condition code, a 

15 reply connection number and interrupts. The address and 
interrupts go to the back end 214 whereas a reply connection 
number goes to the VCtfVPI mapping unit 232 on the send 
side of the front end 210. Also on the send side of the front 
end 210 is send controller 234 which receives information 

20 from the host processor over bus 214 to provide control 
information to the operation logic 230 and a connection 
number to the VQ/VPI mapping unit 232. The send con- 
troller also includes send registers 236 from which opcode, 
operand and index information are provided to a control 

25 encoder 238 which forms this information into the control 
portion of the ATM cell described above in connection with 
FIG. 7. 

A block transfer unit 240 on the send side processes data 
from the host memory into a data buffer 242. A cell forming 

30 unit 244 takes the connection information, control 
information, and data and forms a cell with appropriate 
header information which is then applied to a flow control 
and output buffer units 246. The output buffer is preferably 
a FIFO memory element The message data may alterna- 

35 tively originate from data registers contained within the send 
register 236. 

Aside from the receive and send controllers 228 and 234 
and the operation logic 230, the remaining functional blocks 
are standard for network interfaces or are relatively simple 
40 in function. Thus, detailed description of these is omitted 
The operational logic 230 and receive and send controllers 
228 and 234, with a number of embodimeDts. will now be 
described. 

The operation logic 230 of FIG. 8 will now be described 

45 in more detail in connection with FIG. 9. For ease of 
description and illustration, latches and control signals 
between elements in the figure have been omitted. The 
signals on the left-hand side of FIG. 9 come from the data 
fields in the ATM cell as described above in connection with 

50 FIG. 7. For example, a connection number (conn#) is 
derived from the VCI/VPI in the ATM cell header. 

The operation logic 230 includes a connection table 140 
which caches all or part of the connection table, described 
previously, which is stored by the operating system. 

35 Accordingly, for each connection, this connection table 140 
contains an entry 142, which includes an endpoint number 
146, address register information, including a base 148 and 
a bounds limit 149, and connection state information 150 
and reply connection 151. The use of these fields will be 

60 described in more detail below. 

An endpoint table 160 is also provided which caches all 
or part of the endpoint table, described previously, which is 
maintained by the operating system. Accordingly, for each 
endpoint this endpoint table 160 contains an entry 162. 

63 indexed by endpoint number, which includes an indication 
of the base 164 of the endpoint, its bounds 166 and address 
mapping information 168. The address mapping information 
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refers to a page map structure for that endpoint which is which increases complexity. Nevertheless, a fairly simple 

stored in host memory. This endpoint information is in a implementation of this alternative is presented later. To 

separate table so thai multiple connections to the same support the very common case of indirection with 

endpoint can share the same information. postincrement, the following special case is added to the 

Address registers 170 are also provided. There are many 5 previous primitives: 

options for implementing the address registers. Address efraddr=<addreg,>; 

registers are preferably private to each connection to limit addreg i <-<addregp-+<addreg^ , >. 

problems with managing a possibly limited resource To usc *** s P cdal indirect addressing mode, the sender 

between competing activities. Any address within an end- oni V specifics an index i in the operand. The receiver uses 

point could serve as a register for address indirection. 10 address register i for indirection and automatically uses the 

Unfortunately, such generality causes access speed and De ^^!?!L!^ i^T** • T^iica „ ^ 

™-«*~*K ft n «r«Ku mc cr a ^.^i • f . f , ™, The connection table 140, endpoint table 160. and address 

Flection problems The speed problem is that a message ^ m ^ ^ ^ from mcm0 ry in the network 

with indirect addressing requires a first host memory access iJLf*~ *~;^wTT. - »K*,r ♦ * « S 

• *u _v i *u * *u i ^ . interface which typically is either a static random access 

in the critical path to determine the location to store a mrmAfV /cdam\ «• a a™*™;^ ~>*>a^ 

■ « . . . memory (skam) or a dynamic random access memory 

message and a second host memory access with post- is (DRAM). 

increment mode to store the updated index. The protection A conventional translation look-aside buffer (TLB) 180 is 

problem is that the indirection location can contain any US cd to map an address indicated by the message into a 

address. The protection problem can be solved by translating physical address (PA). The TLB is preferably a fully asso- 

this address, but then two address translations are required dative cache. The TLB matches on bits identifying the 

in the critical path— one to translate the address of the 20 endpoint as well as on the virtual address (VA) since 

indirection location and another to translate the address in multiple endpoints may have the same VA. The TLB also 

the indirection location. For these reasons indirection is stores an indication of the access rights to the physical 

preferably restricted to a number of dedicated address reg- address. 

isters per connection located in interface memory. This The connection table 140. endpoint table 160. address 

arrangement introduces the awkwardness of a separate name 25 registers 170 and TLB 180 are interconnected in the fol- 

space, but avoids host memory accesses in the critical path lowing manner. The endpoint number 146 output from the 

for indirection. Each connection has a number of contiguous connection table 142 is used as the input to the endpoint 

locations to form a register 'Vindow". For convenience and table 160. The base 148 and bounds information 149 from 

flexibility, each connection is allowed to dynamically alio- the connection table are 140 fed respectively to an adder 156 

cate the window size at connection set up time. The base and so and a comparator 158. The adder also receives the index 

bounds fields 148 and 149 in the connection table entry 142 from the received ATM cell (128 in FIG. 7) and its output is 

point to the beginning and end of this window respectively. also fed to the comparator 158. The comparator acts as a 

This scheme allows the overlapping and nesting of register filter through which a valid address to the address registers 

windows to effect different sharing and protection. 170 is provided, otherwise an error trap occurs. 

One problem with using address registers is that the 35 The endpoint table 160 has its outputs for the base 164 

receiver may wish to restrict the access a sender has to and bounds 166 connected respectively to an adder 172 and 

certain address registers. For example, the receiver might comparator 174. The adder 172 also receives an offset from 

not wish to allow the sender to have the direct ability to the received ATM cell (in operand field 126 in FIG. 7). A 

increment a pointer to a queue. For these purposes, the multiplexer 176 revives me output of me adder 172 the base 

address registers also include protection bits 144, which 40 164 from the endpoint table 160 and the offset from the 

indicate what type of access a sender may have to the data received ATM cell. The output of the multiplexer 176 is 

in these registers. Three types of access are protection applied to an input of an arithmetic logic unit (ALU) 178. 

provided for: read, write, and indirect Indirect access allows The ALU also receives as another input a value read from 

a sender to use this information in an operation without the address registers 170. The outputs of the ALU 178 are a 

allowing the sender to determine the actual value. For 45 condition code which connects to the receive controller (228 

example, an indirect address operation with postincrement in FIG. 8) and a result which is connected to a demultiplexer 

can read the register and increment it by a specified amount, 179 and the address registers 170. The demultiplexer 179 

even if the sender docs not have permission to directly read also receives as another input the output of adder 172. The 

or write the register. An exception occurs if the sender output of the demultiplexer 179 is applied to another input 

specifies an operation involving an address register access in 50 of the comparator 174. The output of the comparator 174 is 

a way not permitted by the access protection. The host cither an error or an address within the endpoint range which 

processor can thereupon choose to access the register is men input to the TLB 180. 

directly. The receive controller 228 is used to control the multi- 
Although address register protection allows a receiver to plexer 176, ALU 178, demultiplexer 179. and read and write 
restrict sender access, the sender still names all of the 55 of the address register 170 in accordance with the state 
operands in an operation. This lack of isolation can lead to information 150 from the connection topco 140. the opcode/ 
other types of protection problems. For example, a sender control information from the received ATM cell (130 and 
could still give inconsistent operands, e.g., an operand 122, respectively, in FIG. 7), and the condition code from the 
priority and priority queues that do not match, or specify the ALU 178, There is great flexibility within the receive 
wrong registers. To solve this problem, operand names are 60 controller 228 with respect to features supported and the 
isolated from the sender by being accessible only to the implementation of these features. If only the basic five 
receiver. addressing modes described above are implemented, and 
To provide such isolation, the receiver could decode the because these modes are very simple, the system imple- 
operation to find the receiver operands or simply interrupt mcnting them can be hardwired via a finite state machine, 
the host processor (which incurs significant overhead). To 65 FIG. 10 shows a sketch of such a table-driven implementa- 
allow sufficient flexibility in operand specification, this first tion. The sender controller 234 could be realized in a manna- 
alternative requires programmable control at the receiver analogous to the receive controller 228. 
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Id FIG. 10, the condition code, state and opcode are 
respectively inputs 184, 186 and 188 which are used to index 
a table 190. The outputs to the table are control signals 192 
to the multiplexer, demultiplexer, ALU and latches (not 
shown), new state information 194 and a mask 196. The use 5 
of these outputs will be described in more detail below. In 
FIG. 9. these outputs are all subsumed by the line Labelled 
cootrol from the receive controller to the operation logic. 

Although not shown in FIG. 9, there are data paths to the 
connection table 140, endpoint table 160, address registers to 
170. and TLB 180 so that the host processor can read and 
write their contents. This functionality is provided so the 
operating system can maintain the tables and TLB and so 
applications can access and manipulate the address registers. 
Direct application access to the address registers poses a is 
protection problem though. The easiest solution is to deny 
direct access and force applications to access the address 
registers only via operating system calls. However, this 
solution makes address register access expensive. Another 
solution is to pap the address registers into the application 20 
virtual address (VA) space. That is, address registers of each 
connection are mapped to a different physical address range. 
Then some additional circuitry could extract the connection 
number from this physical address and use the address 
register base 148 and bounds 149 in the connection table 140 25 
to access the appropriate region of the address registers 170. 
With this solution an operating system call is required only 
to establish the mapping. Thereafter an application can 
access address registers using the same address register base 
and bounds circuitry used by incoming messages. 30 

Although the endpoint base has been described as a VA. 
it could also be a physical address (PA). In fact, if the base 
is a PA. the mapping can be simplified to merely a bounds 
check, however incurring two constraints. The first con- 
straint is that an endpoint larger than a page is allocated on 35 
wired down consecutive physical pages. The second con- 
straint is that the address registers can only contain relative 
addresses or PAs, otherwise VA to PA mapping is still 
required The first constraint restrains use of host memory, 
especially if dynamically sized endpoints are desired. The 40 
second constraint means that either an application cannot 
use full address pointers for indirection or the operating 
system is invoked explicitly to map a pointer to a PA before 
storing the pointer in an address register. 

The functioning of operation logic 230 will now be 45 
described. A connection number (conn#) is derived from the 
message which is used to index into the connection table 
140. Protection is provided by insuring that the connection 
number is within the table size. Le. is a valid connection 
number, for example by using the comparator 152. 50 

The index value from the ATM cell is added to the base 
148 from the connection table 140 using adder 156. This 
sum is then compared with the bounds 149 by comparator 
158. If the sum is within the bounds, it is provided as an 
address to the address registers 170. Otherwise, an error trap 55 
occurs. A value from the address register is thus obtained 
and may be applied to the ALU 178. 

The endpoint number obtained from the connection table 
140 is used to retrieve the base 164 which is added to the 
offset value from the received ATM cell using adder 172. 60 
This sum is applied to the multiplexer 176 and demultiplexer 
179. The multiplexer is controlled so as to select one of the 
base 164, the sum from the adder 172 and the offset to be 
applied to the ALU 178, The ALU is controlled to perform 
operations on the value received from the address registers 65 
170 and the output of the multiplexer to provide a condition 
code and an address. The operations which can be performed 
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by the ALU 178 include, but are not limited to, operations 
on address register contents, such as adding for post- 
increment mode, logical operations and comparison, e.g., for 
conditional interrupts. The ALU can be used for various 
other operations on the address register contents. Unless 
conditional interrupts and operations on the address registers 
are to be restricted to addressing modes that do not use the 
address registers, the address register memory should be 
multi-ported or clocked faster than the rest of the operation 
logic 230. 

The result provided by the ALU 178 may be applied to the 
address registers 170. The demultiplexer 179 is controlled so 
as to select one of the sum from the adder 172 and the result 
from the ALU 178. The result is compared with the bounds 
166 for the endpoint using comparator 174 to add to the 
protection described above. 

The final line of protection is finding a valid mapping 
entry in the TLB 180 for the address output from the 
comparator 174. A TLB miss, either because the desired 
address mapping is not present in the TLB or because of 
inadequate access rights, causes an exception to be delivered 
to the operating system. While handling a TLB miss via an 
exception to the host processor uses little hardware, it has 
the disadvantage of blocking further processing of incoming 
cells while the miss is serviced. These incoming cells 
preferably are throttled and buffered but may also be dis- 
carded. Alternatively the interface may either use hardware 
to service TLB misses or have dedicated mapping per 
endpoint to reduce or eliminate misses. The former choice is 
hardware intensive, inflexible, and still will delay cell pro- 
cessing because of delays in accessing the host memory for 
mapping information. The latter choice might be workable if 
the endpoint table is expanded to contain one or two 
mapping entries per endpoint. However, such expansion 
leads to a space and performance tradeoff. 

The receive controller 228 decodes the opcode field 129 
and uses the condition code from the ALU 178 and the state 
information 150 from the connection table 140 to determine 
what to do with the incoming ATM cell. The control signals 
output by the receive controller 228 are determined accord- 
ing to the opcode and condition code and are used to effect 
a desired addressing mode and to store the data. The mask 
output 194 (see FIG. 10) selects which elements. e.g. using 
a 4-byte granularity, of the data are actually stored in the 
receiver memory. For a 32 byte data block, the mask can be 
taken directly from 4 bits of the opcode field. 

The connection state 150 records connection addressing 
information across cells in a multiple cell message. The first 
cell in a multiple cell message is a "control" cell that chooses 
the addressing mode and specifies the offset and index. This 
information is stored in the state field of the connection table 
140 so that subsequent cells for the same connection can 
omit the control information and thus carry more data. For 
example, the first cell could have 32 bytes of data and 
subsequent cells could have 48 bytes of data (perhaps in 
AAL5 format). Each subsequent cell uses the control infor- 
mation stored in the connection table state field. The end of 
such a data cell sequence could be indicated either by storing 
a cell count in the state field, or by using the standard AAL5 
format wherein the last cell in such a sequence carries the 
length and a CRC. Every cell is checked until a correct CRC 
is found. Assuming a multi-cell message terminates with a 
CRC. this scheme could transfer N data blocks of 32 bytes 
each in [2/3 (N-l)]+2 cells for N>1, which asymptotically 
achieves 3/2 the bandwidth compared to sending one 32 byte 
data block per celL 

Many other functions as described above can be imple- 
mented by adding to the opcodes interpreted by the receive 
controller 228. 
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Any exror trap to the operating system may be handled by To send a block of data at an offset from the base of the 

either discarding the cell or by inserting the cell in an error source endpoint to the network via connection C. a write 

queue and signaling an exception to the operating system. operation is performed to the location at the same offset from 

Having now described the receive portion of the front end, connection C's command base. The value written is ignored, 
the send portion will now be described. Both control infor- 5 After mapping, the low order physical address bits are the 
mation and data are provided to send a cell. The control physical address (or an offset as described later) of the data 
information comes from a set of send registers 236 which m me endpoint to send and the high order physical 
each connection has for sending cells. The first three regis- addrcss bits ^zulu the connection number. Tic send cen- 
ters in tfus set store the contro intomation, i.e., the opcode, extracts me mM information from the send regis- 
operand, and index, for tfie ceil. The data comes eUher from w ^ ^ rf mc conDCction Md rcads te ^ block t0 
the endpoint associated with the connection or from a AUr ^.^u. ^ m a 

affS^ ^ which b useM for fonnu * 255£ £ t«z:tlt a 

The set of send registers 236 for each connection also With some host processors mere may not be sufficienl I bits 

contains a status register, a mode register and a "go" register. to encodc * e P hvsical address of block and j? c 

The status register contains the number of cells sent If « connection number, as well as the send controller 234 

initialized to 1, it will be set to 0 when the interface actually address, into a physical address. Two alternatives are pes- 

sends the cell. This feature is useful to serve as an acknowl- stole in this such systems. The first alternative is to replace 

edgment that a sent cell has actually left the interface since the physical address of the data block with the offset from 

a cell might not be sent right away due to flow control or an the endpoint base. This requires the base address of an 

exception. The mode register enables two variants of send- 20 endpoint to be accessible to the send controller 234 and 

ing which will be described below. The go register is used endpoint physical pages preferably to be stored contigu- 

to actually cau se a cell to be sent as will be described below. ously. 

As discussed previously, numerous implementation The second alternative is to use a right- shifted version of 

options are possible for the front end. Several different the physical address of the data block. A very convenient 

embodiments are presented here. A first embodiment of the 25 choice is to shift right by five bits. Since data blocks are 

front end architecture in FIG. 9 uses minimal hardware. addressed at 32 byte granularity, the least significant five bits 

Address register operations are restricted to at most one of the physical address are unused anyway, assuming a 

address register read and write operation per cell. Endpoint byte-addressed machine. This shifting frees five bits in the 

pages are pinned in physical memory while the buffer is physical address for encoding the connection number and 

active. Remote reads are handled by the host processor, via 30 send controller 234 address. However, this right shifting has 

an interrupt. The first restriction means that any combination repercu ssions. Since the page offset is not changed by virtual 

of the primitive operations listed above acting on address memory mapping, the endpoint command virtual address 

registers acts on the same address register. Also, any binary and the data block physical address are linked by the 

operations obtain one argument from the message operand. following constraint: the five most significant bits in the 

Thus the postincrement amount for indirect addressing 35 endpoint command virtual page offset correspond to the five 

operations is specified by the message, which does not least significant bits of the data block physical page number, 

completely isolate the receiver state from the sender. This constraint has three consequences. Pint, the endpoint 

Alternatively, the desired functionality can be composed command region is a factor of 32 smaller man the endpoint: 

from a sequence of messages, e.g., the postincrement each entry in the command region now maps to the base of 

amount could be specified via a register add in a following 40 a data block. Consequently, each block of contiguous 32 

message. Although pinning the endpoint pages prevents pages in the endpoint has the same memory protection, 

large endpoints and even restricts the number of small Second, endpoints are multiples of 32 pages in size. This 

endpoints, these restrictions lead to a simple architecture. constraint can be relaxed if the send controller 234 can 

The receive side of this embodiment is as described access the endpoint base and size. Third, endpoints are 

above, except that, because the endpoint pages are pinned in 45 mapped to contiguous chunks of 32 pages aligned with a 32 

this architecture, the endpoint table 160 contains the physi- page memory boundary. 

cal address of the base of an endpoint. However, since The mode register enables two variants of the send 

physical pages are not necessarily allocated contiguously, a procedure. In the first, the operand is taken from the value 

TLB "cache" of address translation pairs is used to map the actually written to the connection command region. The 

sum of an endpoint base address and offset, labeled "VA" in 50 second causes an exception if an attempt is made to send 

FIG. 9 to the appropriate physical address. For endpoints of when the status field is non-zero. The "go" register is not 

one page or less is size, the PA may be stored directly in the used in this embodiment. 

base field 164 of the endpoint table entry 162. A direct Any state operated 00 by the cell, e.g. write to address 

mapping bit is added to all endpoint table entries 162 to registers, is not updated until after the point of the Last 

control interpretation of the base field. Additional mapping 55 possible exception point, a TLB miss, for that cell. The cell 

entries could be added to the endpoint table 160 to accom- is retained in the input FIFO and processing of further cells 

modate larger endpoints. from that connection is blocked until re-enabled by the host 

The operation of the send side is as follows. First, the send processor; cells are only removed from the input FIFO when 

endpoints are restricted to be integral page sizes. To autho- a cell "commits" after the last exception point 

rize sending to a particular connection without a kernel call. 60 In the second enibodirnent. the three restrictions of the 

each outgoing connection from an endpoint has a unique first embodiment are removed to obtain three major 

virtual mapped command area at a fixed offset (in the high enhancements: no pinning of endpoint pages, remote reads 

order virtual address bits) from the endpoint virtual pages without interrupting the host processor, and a richer set of 

and the same size as the endpoint. These connection com- register and conditional operations. To support this first 

mand pages map uncached to the network interface. The 65 enhancement, the endpoint table 160 contains virtual 

send registers 236 for the connection are mapped into the addresses and the TLB maps from virtual addresses to 

page just below the base of the connection command region. physical addresses. This enhancement also introduces a new 
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category of exceptions: page faults from references to paged everything. It makes more sense to use a microprocessor to 

out endpoint pages. These are treated as another class of interpret high level operations. For example, a microproces- 

exceptions and are serviced by the host processor, which sor could be particularly effective for relieving the host 

maintains the main virtual mapping tables. The operating processor of responsibility for read operations. Such a 

system is now responsible for keeping the mapping infor- 3 communications co-processor could still benefit from hard- 

mation consistent with the host memory state, wired address generation and mapping circuitry to help 

To handle remote reads without the interrupting the host rcduc 5 the computation burden, 

processor, the address mapping for the read data is com- A cmbo ^ 1 ™* * l^grammable ceU inter- 

puted. To do so. the receiver side, which does mapping, is » * * W™ d ° f hardwarc control of the first 

^ . c «. * . . * ul , n two embodiments and the microprocessor control of the 

reused for sending, i.e. the operation logic 230 is mulU- 10 ^ cmbodimcnL ^ lcx ££in or operatioDS m 

plexed between fce receiver side and the sender side In this fmaM b ^ a ho Vprocessor or a cctproceTsor. In this 

embodiment, in fact the operation logic 230 is used for all embodil ^V fr c various functional units required are pref- 

sends and not just sends requested by remote reads. Conse- CTably implemented on a programmable gate array, using 

quently it is no longer necessary to use the virtually mapped DRAM for the various tables. This basic functionality may 

command pages for protection. However, in this is cvcn function as a front-end for a co-processor so that the 

embodiment, the virtually mapped send registers are co-processor would have a reduced load. This can be very 

retained. To send a block of data at an offset from the base helpful to realize both high performance and flexibility at 

of source endpoint the offset is written into the "go** register high date rates like 622 Mbps or 1,2 Gbps. 

for the appropriate connection attached to mat endpoint Where possible, these embodiments could easily be pipe- 

This write causes a data block at the offset 32 byte block 20 lined for high performance. However, at 155 Mbps data 

aligned, from the endpoint base to be read and composed rates, bytes of data arrive at about 50 nsec intervals, which 

into a cell with the control information in the opcode. should be long enough to complete the address and control 

operand, and index registers and sent via the associated setup. It may be necessary to buffer the data for about a word 

connection. A multiple send mode is also added in which the or so to satisfy memory hierarchy access times. At 622 Mbps 

status register is set to the number of cells to send starting 25 only a few pipeline stages should be necessary, 

at the specified offset As described above, any of these embodiments for the 

In this embodiment, three primitive operations can be front end 210 is also connected to a "back end" 214 of the 

executed per message: address generation, register network interface for connecting the front end to the 

operations, and conditionals. A main opcode controls the memory of its host Three embodiments for such a back end 

selection and ordering of the primitive operations. Example 30 will now be described. 

opcodes are read, read multiple, write, write multiple, and In a first embodiment the front end may connect directly 

software exception which causes an interrupt to the host to the main memory bus. A cache controller snoops this bus 

processor. The instruction format allows up to three different to ensure coherency. 

register operands to be named in addition to an immediate Alternatively, while a direct memory connection is attrac- 

operand. To accommodate all the accesses to the address 35 tive performance-wise, it is only an option to computer 

register 170, the register operations are all triple clocked builders. An I/O bus interface such as the PCI bus would be 

Due to potential side effects, state recovery after exceptions accessible to a far greater market. The disadvantage of most 

is more complicated. I/O buses, including the PCI bus, is delay in gaining control 

It is sometimes preferable to have greater flexibility in of the bus due to the activity of other bus devices, thus some 

control functionality and cell interpretation than may be 40 degree of on-board buffering is used which adds to latency, 

provided by the first two embodiments. For example, opera- As another alternative, a direct cache interface could be 

dons for locking or swap-and-compare may be desired. used which does not require processor modifications. In this 

Also, it might be useful to customize the cell level protocol embodiment the network connects directly to external pro- 

for certain applications. Ultimately, flexibility could be cesser cache. This direct coupling of the network to a cache 

available in the form of full programmability, which is 45 may reduce the copying of message data. To avoid unduly 

always a tradeoff between complexity and cost Two ways to diluting this cache with network traffic, and thus negatively 

add flexibility to the previous embodiments are to make the impacting the performance of the processor to which it 

receive and send controllers 228 and 234 writable, and to connects, another embodiment couples the network to a 

add a programmable finite state machine for cell interpre- separate message cache 252 as shown in FIG. 11. 

tation. The first non-header word of an ATM cell may index 50 In FIG. 11, a message cache 252 is connected to the 

into a writable control memory that interprets the remaining micro-processor 250 via a bus 256. The message cache 252 

fields of the cell and sets all the control signals. This control is also connected to data cache 254 and main (or host) 

memory might have a number of common cell interpreta- memory (not shown) via a bus 258. A mapping unit 260 

tions hardwired in and a number of programmable ones. connects to the message cache via bus 262 and to the 

perhaps even connection-dependent interpretations. 35 network. 

However, a micro-processor should also be considered for This message cache 252 is fully integrated into the 

mis purpose. memory hierarchy (as shown by connections 256 and 258), 

Thus, a third embodiment uses a conventional so there is no need to copy data from a message buffer to the 
microprocessor, in effect a communication co-processor, far memory hierarchy before a process can access the data. The 
the entire front end Hie microprocessor simplifies the 60 interface may be implemented at the secondary cache level, 
hardware since all of the internal control logic of the and thus no expensive, special purpose, or custom processor 
microprocessor is leveraged and its own TLB can be used as modifications are required. By restricting the data size of 
the TLB 180. The endpoint number may be incorporated in messages to be equal to the cache block size, e.g., 32 bytes, 
the high address bits of the VA. Also, fully programmable cache blocks can be updated atomicaHy, eliminating corn- 
cell interpretation and control may be obtained. 65 plicated and slow circuitry for updating partial cache blocks. 

Although the third embodiment provides flexibility, even A problem in this direct cache interface in FIG. 11 is 

at 155 Mbps, it may be difficult to use a microprocessor for maintaining coherency between the message cache 252 and 
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cache 254 using a low overhead mechanism. To solve this The mappings in the endpoint table entries could be man- 
problem, the magnitude of the incoherency problem is aged by the application so as to always contain mappings 
reduced by allowing only the network to write into the likely to be used soon. 

message cache 252. Then each write into the message cache Locality in the connection and endpoint tables can also be 

252 checks for and invalidates any blocks in the data cache 5 managed. That is. with many connections and large 

254 with a matching tag. The impact of this checking on the enapoints, the respective tables might be large in size. One 

data cache 254 is minimized by performing the checks on a idea to reduce the table sizes required is to exploit locality 

shadow copy of the data cache tags. Further details on such by essentially paging" out entries from the tables. On a 

a direct cache interface may be found in a copending U.S. page fault when accessing these tables, the interface can 

patent application entitled "Low Latency Network 10 remove the page faulting cell from the cell processing 

Interface**, filed Nov. 16. 1993 by Randy B. Osborne. pipeline and buffer and flow control further cells for the page 

Numerous modifications and variations to the embodi- faulted connection as described above. The host processor 
ments of the network interface can be made. For example. can then perform the actual "paging" of the tables, restore 
exception handling and flow control can be integrated in the state, and restart cell processing for the faulted connection, 
front-end. for the following reasons. Exceptions caused by is Structures can also be added to provide fair network 
error traps, protection violations, unimplemented access and to prevent network deadlock. That is. processes 
operations, TLB misses, and page faults, e.g.. host memory are prevented from interfering with each other either by 
page faults, and possihly connection and endpoint table page blocking the network or by congesting the network and 
faults, can slow cell processing in this system and thus lead inducing deadlock in other processes. To ensure fairness 
to a flow control problem. Cell processing can also be 20 some form of admission control could be provided that 
blocked to ensure atomicity during host processor accesses limits the duration that one connection can send to the 
to this system memory structures. For error traps, this network if other connections have pending, non-flow con- 
problem can be avoided entirely by immediately discarding trolled traffic. For performance reasons it is also a good idea 
the offending cells or pushing it into a buffer overflow to give priority to operating system traffic. Preventing dead- 
problem by putting such cells on an exception queue for 25 lock requires several steps. First, each connection has inde- 
examination at the convenience of the operating system. Of pendent flow control. Independent flow control per VCI/VPI 
course, discarding the offending cells will not work for the if fairly standard in ATM networks and interfaces. Second, 
other exception types because even if the offending cell as any global exceptions that block processing of all cells have 
discarded and an implicit or explicit 4 *retry rt message is bounded duration. Third, cells requiring a response from the 
returned to the sender, the exception condition still has to be 30 network are removed even though the reply connection may 
repaired before forward progress can be made. In the be flow controlled. This removal is made possible by reserv- 
me an time, incoming cells are discarded, buffered, or ing some buffer capacity for reply traffic, such as by allo- 
throttled. eating a separate VCI/VPI with associated flow control 

However, mis only applies to incoming cells belonging to buffers, per connection just for reply traffic. Admission 

Che same connection affected by the exception condition. 35 control should favor reply traffic over new traffic. 

Cells belonging to other connections can be processed once To ensure at least the operating system can always make 

the exception condition is saved. Cells belonging to the progress, it should also have its own connection. Any pages 

exception incurred connection cannot be processed, even if that the operating system might use. such as in the connec- 

they are not affected by the exception. Le.. they do not cause tion and endpoint tables and address registers, should be 

a TLB miss or page fault, since some applications may 40 locked down to prevent unbounded delay due to page 

depend on the guarantee of sender ordering of ATM cells. thrashing. 

Since this ordering guarantee is per channel, which maps to Global address registers may also be used in the front end. 

connection in this system, it is acceptable to continue In the embodiments described above, address registers are 

processing cells which belong to different connections but currently private to each connection. This could make it 

share the same endpoint as an exception incurred cell. Of 45 inconvenient for several different connections to the same 

course, these cells may cause an exception. endpoint to share a common queue. One way to solve this 

A strategy for dealing with exceptions is provided will problem is to add a number of address registers that are 

now be discussed. First, the offending cell is removed from global to the endpoint rather than being strictly local to 

the front end as quickly as possible. Cells causing error traps connections. The endpoint table could contain base and 

can be discarded or queued. Other cells arc retained in the 50 bounds to such global registers in the same memory as the 

input FIFO. The connection is then marked as exoeptioned. local registers. 

Flow control is invoked next to throttle senders transmitting Fragmentation buffers can also be provided in the front 

further cells for that connection. In the meantime, any end. That is. for multi-cell messages, matching the 48 byte 

further cells which arrive for that connection are buffered, payload of ATM AAL5 to power of 2 memory and cache 

whereas other connections may continue processing cells. 55 block sizes leads to fragmentation problems. For example. 

Only the first step is necessary for error traps. Global with 32 byte blocks, there will be 16 byte fragments. Both 

"exceptions", like blocking the system during host processor the sender and the receiver may keep fragmentation buffers 

accesses, require throttling and buffering across all connec- to fragment and then reassemble 32 byte blocks, or whatever 

tions. The throttling and buffering are compatible with the memory and cache block size is. However, it highly 

credit-based flow control schemes. 60 likely mat it will be possible to leverage whatever segmen- 

Hyhrid address mapping could be used as an alternative to tation and reassembly support there is already is in a 

a large global TLB for mapping. That is. each endpoint standard high bandwidth ATM interface for this fragmcnta- 

buffer table entry could cootai n one or two mappings and the tion purpose. 

TLB could contain the rest This hybrid mapping is a The sender and receiver addresses also may not be aligned 

generalization of the Idea, discussed above in connection 65 with respect to a memory or cache block. Assuming that 

with the first front end embodiment, of inserting the map- (mis)alignment is the same at the sender and receiver, an 

ping for single page endpoints directly in the endpoint table. aligned block can be sent and the portions that should not be 
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received can be merely masked out This may not lead to the 
best utilization of cells, but such misalignment is likely to be 
rare. Another possible performance penalty is that some 
architectures may restrict subblock addressing, so it may be 
necessary to read a whole block in order to write it selec- 
tively under a mask. The masking, or scatter, operation is 
done at the receiver. The sender may also perform masking, 
but without a gather capability it is optional: it might be 
useful for security to prevent certain fields from going out on 
the network, but it does not improve efficiency. 

In some cases the sending address may be unaligned both 
respect to block boundaries and with respect to the receiver 
address. Shifting, in addition to masking, is required to deal 
with this problem. To solve this problem, alignment shifters 
can be added. 

A prefetch queue could also be used in the front end 
because prefetching is useful for hiding latency. The system 
described above actually already supports prefetch queues. 
To provide complete prefetching, a processor that can ini- 
tiate prefetches, such as the DEC Alpha, may be used and 
some hardware that can decode prefetch operations from this 
processor and turn them into real fetches. It is probably not 
possible to put such hardware on a I/O bus-based interface; 
but it could certainly be done, as it was done in the Cray Tera 
3D. in designs where the network interface connects directly 
to the data bus. Provided that a fetch can be initiated in some 
way. the return address of the fetch is simply listed as the tail 
of a queue. The conditional interrupt scheme could even be 
used to indicate when the queue is nonempty. 

Also, in some applications, it may desirable to provide 
greater isolation between the sender and the receiver. 

Id direct deposit model presented above, the sender places 
an instruction and operands directly into a message. Oper- 
ands may be immediate, or reference some state (in the 
address registers) in the receiver. The receiver executes the 
instruction. As described earlier, access restrictions on the 
address registers provide some degree of protection and 
isolation. However, there still may be inadequate isolation 
for some applications due to the fact that a sender still has 
to name all the instruction operands, even if it cannot access 
those operands. One way to solve this problem is simply to 
interrupt the host processor to perform actions requiring 
isolation, as in receiver-based addressing systems. 

Another way to solve this problem is to allow the direct 
instruction (operation) in a message to be replaced by a 
pointer to an instruction in the receiver (an instruction 
pointer). The instruction may directly reference receiver 45 
operands, e.g. in the address registers, without knowledge of 
the sender. Messages can still provide immediate operands 
and name receiver operands, though the receiver may choose 
not to use these operands, as may be necessary to maintain 
isolation. 

One simple embodiment of this solution will now be 
presented. The principle modification to the receive side as 
described to this point is to add an instruction and operand 
buffer area as shown in FIG. 13. The connection table 149 
now contains base and bounds entries 300 and 302, similar 
to that for address registers, for access to an instruction 
memory 304. To keep the scheme simple, each instruction is 
composed of an operation and operands in exactly the same 
format as in an ATM cell as described earlier. The operation 
controls from which location — the instruction memory 304 
in the receiver or the message— operands are taken. Protec- 
tion bits, similar to those for the address registers 170. allow 
the receiver to control which instructions serve as entry 
points to the sender. 

Three are at least enhancements to this scheme. The first 
is global instructions. Many connections are likely to share 65 
the same operations, though on different operands. To 
accommodate this expectation, a capability is added for 



global instructions. These are instructions that are globally 
accessible across all connections. They are constrained to 
operate only on sender-supplied operands to rnlnimize the 
difficulty of operating on different receiver operands. 

The second enhancement is separate instruction and oper- 
and memory. Instruction and operand memory could be 
separated at the cost of complexity to save memory storage. 

The third enhancement is providing multiple instructions 
per celL A sequencer can be added to step the receiver 
through several instructions per cell. The first instruction in 
such a sequence serves as the entry point 

It is easy to further add conditional sequencing 
operations, subroutines, etc. However, to keep the interface 
simple, more complex functionality is best obtained by 
trapping to the host processor or by adding a co-processor, 
such as a micro-processor. 

Having now described a few embodiments of the 
invention, and some modifications and variations thereto it 
should be apparent to those skilled in the art that the 
foregoing is merely illustrative and not limiting, having been 
20 presented by way of example only. Numerous modifications 
and other embodiments are within the scope of one of 
ordinary skill in the art and are contemplated as falling 
within the scope of the invention as limited only by the 
appended claims and equivalents thereto. 
What is claimed is: 

1. A communication system, comprising: 
a sender having a processor which outputs requests that 

messages be sent at and a network interface connected 
to the processor which forms a message, in response to 
a request from the processor; 
a receiver comprising a processor controlled by an oper- 
ating system and connected to a network interface 
having a memory; 
wherein the message formed by the sender contains an 
operand, an indication of a desired operation and a 
reference to information in the memory in the network 
interface of the receiver; 
a network, connected between the network interface of the 
sender and the network interface of the receiver, for 
communicating the message between the sender and 
the receiver; 

wherein the network interface of the receiver Includes an 
input buffer for receiving the message and means, 
responsive to the message in the input buffer, for 
performing the operation indicated by the message 
according to the operand in the message and the 
information in the memory in the network interface of 
the receiver only if the operation is permitted to be 
performed by the receiver, and 
wherein the operand indicates an address in a host 
memory in the receiver and wherein the action is 
depositing the message at a location in the host memory 
in the receiver according to the address In the message 
and the information in the memory of the network 
interface of the receiver. 

2. The communication system of claim 1, wherein the 
operand in the message indicates an address register for 
storing an address in the host memory of the receiver and the 
information in the memory of the network interface of the 
receiver is the address stored in the address register, the 
communication system further comprising in the receiver 

means for obtaining the address from the indicated 

address register, and 
means for storing the message in the host memory at the 
obtained address. 

3. The communication system of claim 2, wherein the 
operand further indicates an offset and the communication 
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system further comprises means for updating the address in 
the address register by the offset. 

4. The communication system of claim 1, wherein the 
receiver includes means for comparing the operand to the 
information in the network interface of the receiver, and 5 
means for generating an Interrupt when the means for 
comparing indicates that the message requires immediate 
action. 

5. The communication system of claim 4, further com- 
prising means for updating the information in the memory of 
the network interface of the receiver. 10 

6. The communication system of claim 4, further com- 
prising means for queueing the message when the means for 
comparing indicates that the message does not require 
immediate action. 

7. The communication system of claim 4, wherein the 15 
operand in the message indicates an offset and an address 
register storing an address in a host memory of the receiver 
and the information in the memory of the network interface 

of the receiver is the address stored in the address register, 
the communication system further comprising, in the 20 
receiver: 

means for obtaining the address from the indicated 

address register; and 
means for storing the message in the host memory at the 

offset from the obtained address. 25 

8. A method for communication in a system having a 
sender and a receiver interconnected by a network, wherein 
the receiver has a network interface connected between the 
network and a host processor and host memory controlled by 

an operating system, the method comprising the steps of: 30 

sending a message from the sender through the network to 
the receiver, wherein the message includes an operand, 
an indication of the operation to be performed and a 
reference to information in a memory of the network 
interface of the receiver; 35 

receiving the message at the receiver; 

insuring that the operation to be performed for the sender 
is permitted at the receiver; and 

if the action is permitted, performing, separate from the 
processor and operating system, the operation at the 40 
receiver according to the operand in the message and 
the information in the memory of the network interface 
of the receiver; and 

wherein the operand indicates an address in a memory in 
the receiver and wherein the step of performing an 43 
operation is the step of depositing the message at a 
location In the host memory in the receiver according 
to the address in the message and the information in the 
memory of the network interface of the receiver. 

9. The method of claim 8, wherein the operand in the 30 
message indicates an address register for storing an address 

in the host memory of the receiver and the information in the 
memory of the network interface of the receiver is the 
address stored in the address register, the method further 
comprising the steps of: 35 
obtaining the address from the indicated address register, 
and 

storing the message in the host memory at the obtained 
address. 

10. The method of claim 9, wherein the operand further 60 
indicates an offset and the coanrnunication system further 
comprises the step of updating the address in the address 
register by the offset 

11. The method of claim 8, wherein the step of performing 
an operation comprises the steps of: 65 

comparing the operand to the information in the memory 
of the network interface of the receiver; and 



generating an interrupt when the step of comparing indi- 
cates that the message requires immediate action. 

12. The method of claim 11. further comprising the step 
of updating the information in the memory of the network 
interface of the receiver. 

13. The method of claim 11, further comprising the step 
of queueing the message when the step of comparing 
indicates that the message does not require immediate 
action. 

14. The method of claim 11. wherein the operand in the 
message indicates an offset and an address register for 
storing an address in the host memory of the receiver and the 
information in the memory of the network interface of the 
receiver is the address stored in the address register, the 
communication system further comprising the steps of: 

obtaining the address from the indicated address register; 
and 

storing the message in the host memory at the offset from 
the obtained address. 

15. A communication system for a computer system 
having a sender and a receiver interconnected by a network, 
wherein the receiver has a network interface connected 
between the network and a host processor and host memory 
controlled by an operating system, the communication sys- 
tem comprising: 

means, at the sender, for sending over the network to the 
receiver a message which includes an operand, an 
indication of an operation to be performed and a 
reference to information in a memory in the network 
interface of the receiver; 

means, in the network interface of the receiver, for 
receiving the message; 

means, in the network interface and operative in response 
to receipt of the message, for determining whether the 
operation is permitted to be performed at the receiver, 
and 

means, in the network interface and operative when the 
operation is permitted, for obtaining an additional 
operand using the reference to information in the 
memory of the network interface of the receiver, and 
for performing the operation, separate from the host 
processor and operating system, using the operand in 
the message and the additional operand; and 

wherein the operand indicates an address in the host 
memory in the receiver and wherein the means for 
performing an operation includes means for depositing 
the message at a location in the host memory in the 
receiver according to the address in the message and 
the information in the memory of the network interface 
of the receiver. 

16. The communicatioo system of claim 15, wherein the 
operand in the message indicates an address register in the 
network interface for storing an address in the host memory 
of the receiver and the information in the memory of the 
network interface of the receiver is the address stored in the 
address register, the oornmunication system further compris- 
ing: 

means for obtaining the address from the indicated 

address register, and 
means for storing the message in the host memory at the 

obtained address. 

17. The communication of claim 16, wherein the operand 
further indicates an offset and the system further comprises 
the means for updating the address in the address register by 
the offset 
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